gh-142183: Change data stack to use a resizable array#148681
gh-142183: Change data stack to use a resizable array#148681dpdani wants to merge 5 commits intopython:mainfrom
Conversation
|
@pablogsal can you take a look this week? 🙏 |
I can try but I have some other PRs first in my review queue :( |
There was a problem hiding this comment.
Thanks for working on this. I am worried about a couple of consequences that I think we should account for before continuing with this:
A minor concern is that this seems to change Python stack memory from “roughly current depth” usage to “high-water mark for the lifetime of the thread”. In the old chunked implementation, deep recursion allocated additional 16 KiB chunks and then released most of them while unwinding. In this version, resize_stack() keeps previous stack chunks linked from stack_chunk_list, and _PyThreadState_PopFrame() only moves stack_top; the chunks are not freed until the thread state is deleted.
This doesn't seem to be a lot so I am not too worried.
The worse concern I have I’m also concerned about profiler/debugger consequences. The old stack chunk layout allowed _remote_debugging/external unwinders to bulk-copy stack chunks cheaply. With this change, the active frame chain can span older chunks while only the newest chunk is copied in the new _remote_debugging path, so older frames fall back to individual remote reads. For a 1000-frame stack I measured:
- old no-cache unwinding: 4 memory reads, ~1.2 KiB read
- new no-cache unwinding: 966 memory reads, ~85.8 KiB read
Tachyon is probably mostly insulated because it uses frame caching, but first samples, cache-disabled paths, fallback paths, and external tools still care. Other profilers like austin currently hard-codes the old _PyStackChunk layout (previous, size, top, data), while this patch changes it to (size, previous, data), so those tools need explicit updates.
Given this and the potential gains I do not find the tradeoff very convincing....
|
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
| chunk_addr = GET_MEMBER(uintptr_t, chunks[count].local_copy, offsetof(_PyStackChunk, previous)); | ||
| count++; | ||
| // Process this chunk | ||
| if (process_single_stack_chunk(unwinder, chunk_addr, &chunks[count]) < 0) { |
There was a problem hiding this comment.
Unnless I’m missing something, stopping after a single chunk here looks like a large perf regression for the profiler. The runtime still has a linked chunk chain via stack_chunk_list -> previous, but this now only copies the newest chunk. If the active frame chain spans older chunks, find_frame_in_chunks() misses those frames and we fall back to parse_frame_object(), which does one remote memory read per frame.
|
Unfortunately the more I think about it the less I like it: this model is harder to reason about than the previous linked-list model, and I think that matters because it creates a very easy footgun: the current chunk looks like the current stack backing store, but it is not. Existing live frames may still be in older chunks, while newer frames are in the newest chunk at a matching logical offset. So any code that treats The risk is not just external profilers. This makes future runtime/debugger code more fragile because pointer validity and frame ownership now depend on searching the whole chunk chain, not checking the current chunk. |
|
@pablogsal
There seems to be a misunderstanding here. No two frames will have the same offset.
The stack is a linked list of frames, not chunks. Some of which (generator and coroutine frames) aren't in chunks at all, so tools already need to handle pointers outside of the current chunk. Overall, I don't see how this really changes anything for an out-of-process profiler: Copy all the chunks, then traverse the stack. Also, note that the lower, unused part of the current chunk in a stack with multiple chunks will be untouched, so a profiler should be able to detect when it needs to cross to another chunk.
That is trade off that tools and libraries make: either they use stable API/ABIs, or they probe CPython internals. If they do the latter, they will need updating every release. @P403n1x87 would this be a problem for you? |
The issue I was pointing at is the current The fix here is to either restore eager copying of the full
Also, you’re right that “same logical offset” was poor wording. I meant that the new chunk starts allocating at the old stack depth, leaving the lower part of the new chunk unused; not that two frames have the same offset. |
|
Here is a repro: In main we can see: with this PR: Notice |
Yes, that's my bad. I made the smallest changes possible to the remote debugging module to get the PR working, but didn't inspect further improvements. Maybe it can be done in a follow-up PR by people more knowledgeable on the module? Or would you consider that a blocker? |
This is a blocker. This PR adds a regression and that's not acceptable. |
8115b28 to
239e2eb
Compare
|
@dpdani I pushed a fix for the concrete I also added a regression test to make sure deep stacks are resolved from copied chunks rather than falling back to parsing frames individually from remote memory. |
|
That said, I still think this solution is very confusing and too complex. It is harder to reason about where frames live, which chunks are relevant, and what invariants the implementation can rely on. I am worried this makes future changes easier to get subtly wrong. |
239e2eb to
99e9e44
Compare
|
I investigated an alternative implementation in #149097: instead of replacing the stack with the resizable-array model, it keeps the current chunked- stack invariants and extends the existing one-chunk cache into a small bounded per-thread cache. I think this is a better direction because it fixes the allocator-thrashing issue without changing where live frames can reside, without changing The cache is intentionally bounded: currently up to I cehcked the original repro and it no longer shows per-branch mmap/munmap churn. |
|
To fix the really bad behavior when we are constantly crossing the boundary between chunks, a single cached chunk is sufficient. See #145828 However, calls that cross chunk boundaries will prevent specialization and jitting code as the code keeps having to hit the slow path of not enough space in the current chunk. Also, it allows us to use smaller initial chunks, as don't need to worry about the cost of crossing chunk boundaries. |
I don't think so, I think it is fairly elegant. But it is a bit subtle. |
|
@pablogsal would it help if the lower, unused part of the chunk were zeroed-out? |
Yes I think with more docs we can make this much more palatable. I want to say that I may be in the minority here so it would be great to get a 2nd opinion from @Fidget-Spinner regarding the complexity argument.
This is how I understand the tradeoff: The benefit of this approach over the chunk-cache approach is that it addresses a different problem. A cached chunk fixes the allocator-thrashing case, but a hot call near a chunk boundary can still keep failing The part that still makes me prefer the cache approach is the change in stack/profiler invariants. With the cache approach, inactive chunks are detached, and the active On zeroing the lower unused part: I think it could help as a defensive/debugging aid, but I would not rely on it as the correctness mechanism. A profiler/unwinder should still follow It may still be worth doing if we want the unused prefix to be deterministic and make heuristic scans fail cleanly. The main caveat is cost: explicit zeroing can touch pages that would otherwise remain lazily zeroed by the OS, so I would want to avoid zeroing large regions unless we measure that it is negligible. Perhaps we can alleviate the fact that profilers will need to copy "old zeroed chunks" byt keeping a pointer to the start of the "new" chunks? I can try to profile a bit....maybe I am overreacting to this and the difference is negligible. Will investigate |
| const intptr_t offset = ptr - start; | ||
| const intptr_t usable_size = (intptr_t)(chunk->size - _PY_STACK_CHUNK_OVERHEADS); | ||
| return offset >= 0 && offset < usable_size && start + offset == ptr; |
There was a problem hiding this comment.
| const intptr_t offset = ptr - start; | |
| const intptr_t usable_size = (intptr_t)(chunk->size - _PY_STACK_CHUNK_OVERHEADS); | |
| return offset >= 0 && offset < usable_size && start + offset == ptr; | |
| const uintptr_t usable_size = (uintptr_t)(chunk->size - _PY_STACK_CHUNK_OVERHEADS); | |
| return ptr >= start && (uintptr_t)(ptr - start) < usable_size; |
There was a problem hiding this comment.
As I have learned while working on this PR, those kinds of pointer comparisons are undefined behaviour according to the C standard, because they do not belong to the same object. See Section 6.5.8, paragraph 5 (https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf).
So ptr >= start is undefined, but computing where the pointer would have been in the new allocation and checking for equality is ok.
As any change of this nature, it is indeed a problem, but as you point out, it is something that maintainers of this kind of tools have to put up with anyway. For the case of Austin specifically, the crucial aspect is being able to work out quickly all the memory chunks with frame data that needs to be copied eagerly to ensure as much data coherence as possible (Austin does not pause threads when unwinding the Python stack). Once that data has been moved over we can work out what to do with it. |
This PR changes the implementation of the Python stack to use a resizable array. This avoids the problem of calls that frequently cause the
datastack_top(now calledstack_top) pointer to switch between allocations.After resizing, previous array allocations are not immediately freed because that would cause issues for various bits around the VM still pointing into them, and are instead freed along with the tstate.
During resizing, the previous contents of the stack are not copied into the new allocations, and instead the memory of the previous allocation is still used. Subsequently, popping and pushing frames, the new frames will always be residing on the new stack chunk allocation.
Overall it results in a ±1% performance change (within the noise range), but it avoids degenerate cases for any number of frames. I am also told it would allow further optimizations in the JIT.